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Abstract We consider testing independence in group-wise selections with 
some restrictions on combinations of choices. We present models for frequency 
data of selections for which it is easy to perform conditional tests by Markov 
chain Monte Carlo (MCMC) methods. When the restrictions on the combina- 
tions can be described in terms of a Segre- Veronese configuration, an explicit 
form of a Grobner basis consisting of binomials of degree two is readily avail- 
able for performing a Markov chain. We illustrate our setting with the National 
Center Test for university entrance examinations in Japan. We also apply our 
method to testing independence hypotheses involving genotypes at more than 
one locus or haplotypes of alleles on the same chromosome. 
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1 Introduction 

Suppose that people are asked to select items which are classified into cate- 
gories or groups and there are some restrictions on combinations of choices. For 
example, when a consumer buys a car, he or she can choose various options, 
such as a color, a grade of air conditioning, a brand of audio equipment, etc. 
Due to space restrictions for example, some combinations of options may not 
be available. The problem we consider in this paper is testing independence of 
people's preferences in group- wise selections in the presence of restrictions. We 
assume that observations are the counts of people choosing various combina- 
tions in group- wise selections, i.e., the data are given in a form of a multiway 
contingency table with some structural zeros corresponding to the restrictions. 

If there are m groups of items and a consumer freely chooses just one item 
from each group, then the combination of choices is simply a cell of an m- 
way contingency table. Then the hypothesis of independence reduces to the 
complete independence model of an m-way contingency table. The problem 
becomes harder if there are some additional conditions in a group-wise se- 
lection. A consumer may be asked to choose up to two items from a group 
or there may be a restriction on the total number of items. Groups may be 
nested, so that there are further restrictions on the number of items from sub- 
groups. Some restrictions may concern several groups or subgroups. Therefore 
the restrictions on combinations may be complicated. 

As a concrete example we consider restrictions on choosing subjects in the 
National Center Test (NCT hereafter) for university entrance examinations 
in Japan. Due to time constraints of the schedule of the test, the pattern of 
restrictions is rather complicated. However we will show that restrictions of 
NCT can be described in terms of a Segre- Veronese configuration. 

Another important application of this paper is a generalization of the 
Hardy- Weinberg model in population genetics. We are interested in testing 
various hypotheses of independence involving genotypes at more than one locus 
and haplotypes of combination of alleles on the same chromosome. Although 
this problem seems to be different from the above introductory motivation on 
consumer choices, we can imagine that each offspring is required to choose two 
alleles for each gene (locus) from a pool of alleles for the gene. He or she can 
choose the same allele twice (homozygote) or different alleles (heterozygote). 
In the Hardy- Weinberg model two choices are assumed to be independently 
and identically distributed. A natural generalization of the Hardy- Weinberg 
model for a single locus is to consider independence of genotypes of more than 
one locus. In many epidemiological studies, the primary interest is the cor- 
relation between a certain disease and the genotype of a single gene (or the 
genotypes at more than one locus, or the haplotypes involving alleles on the 
same chromosome). Further complication might arise if certain homozygotes 
are fatal and can not be observed, thus becoming a structural zero. 

In this paper we consider conditional tests of independence hypotheses in 
the above two important problems from the viewpoint of Markov bases and 
Grobner bases. Evaluation of P- values by Markov chain Monte Carlo (MCMC) 
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method using Markov bases and Grobner bases was initiated in Diaconis and 
Sturmfels (1998). See also Sturmfels (1995). Since then, this approach at- 
tracted much attention from statisticians as well as algebraists. Contributions 
of the present authors are found, for example, in Aoki and Takemura (2005, 
2007), Ohsugi and Hibi (2005, 2006, 2007), and Takemura and Aoki (2004). 
Methods of algebraic statistics are currently actively applied to problems in 
computational biology (Pachter and Sturmfels, 2005). In algebraic statistics, 
results in commutative algebra may find somewhat unexpected applications 
in statistics. At the same time statistical problems may present new problems 
to commutative algebra. A recent example is a conjunctive Bayesian network 
proposed in Beerenwinkel et al. (2006), where a result of Hibi (1987) is success- 
fully used. In this paper we present application of results on Segre- Veronese 
configuration to testing independence in NCT and Hardy- Weinberg models. In 
fact, these statistical considerations have prompted further theoretical devel- 
opments of Grobner bases for Segre- Veronese type configurations and we will 
present these theoretical results in our subsequent paper (Aoki et al., 2007). 

Even in two-way tables, if the positions of the structural zeros are arbitrary, 
then Markov bases may contain moves of high degrees (Aoki and Takemura, 
2005). See also Huber et al. (2006) and Rapallo (2006) for Markov bases of 
the problems with the structural zeros. However if the restrictions on the 
combinations can be described in terms of a Segre- Veronese configuration, 
then an explicit form of a Grobner basis consisting of binomials of degree 
two with a squarefree initial term is readily available for running a Markov 
chain for performing conditional tests of various hypotheses of independence. 
Therefore models which can be described by a Segre- Veronese configuration 
arc very useful for statistical analysis. 

The organization of this paper is as follows. In Section 2, we introduce 
two examples of group-wise selection. In Section 3, we give a formalization 
of conditional tests and MCMC procedures and consider various hypotheses 
of independence for NCT data and the allele frequency data. In Section 4, 
we define Segre- Veronese configuration. We give an explicit expression of a 
reduced Grobner basis for the configuration and describe a simple procedure 
for running MCMC using the basis for conditional tests. In Section 5 we present 
numerical results on NCT data and diplotype frequencies data. We end the 
paper by some discussions in Section 6. 



2 Examples of group-wise selections 

In this section, we introduce two examples of group-wise selection. In Section 
2.1, we take a close look at patterns of selections of subjects in NCT. In 
Section 2.2, we illustrate an important problem of population genetics from 
the viewpoint of group-wise selection. 
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2.1 The case of National Center Test in Japan 

One important example of group-wise selection is the entrance examination 
for universities in Japan. In Japan, as the common first-stage screening pro- 
cess, most students applying for universities take the National Center Test for 
university entrance examinations administered by National Center for Univer- 
sity Entrance Examinations (NCUEE). Basic information in English on NCT 
in 2006 is available from the booklet published by NCUEE ([12] in the refer- 
ences). After obtaining the score of NCT, students apply to departments of 
individual universities and take second-stage examinations administered by the 
universities. Due to time constraints of the schedule of NCT, there are rather 
complicated restrictions on possible combination of subjects. Furthermore each 
department of each university can impose different additional requirement on 
the combinations of subjects of NCT to students applying to the department. 

In NCT examinees can choose subjects in Mathematics, Social Studies 
and Science. These three major subjects are divided into subcategories. For 
example Mathematics is divided into Mathematics 1 and Mathematics 2 and 
these are then composed of individual subjects. In the test carried out in 2006, 
examinees could select two mathematics subjects, two social studies subjects 
and three science subjects at most as shown below. The details of the subjects 
can be found in web pages and publications of NCUEE. In this paper, we 
omit Mathematics for simplicity, and only consider selections in Social Studies 
and Science. In parentheses we show our abbreviations for the subjects in this 
paper. 

— Social Studies: 

o Geography and History: One subject from {World History A (WHA), 
World History B (WHB), Japanese History A (JHA), Japanese History 
B (JHB), Geography A (GeoA), Geography B (GeoB)} 

o Civics: One subject from {Contemporary Society (ContSoc), Ethics, 
Politics and Economics (P&E)} 

— Science: 

o Science 1: One subject from {Comprehensive Science B (CSciB), Biol- 
ogy I (Biol), Integrated Science (IntegS), Biology IA (BioIA)} 

o Science 2: One subject from {Comprehensive Science A (CSciA), Chem- 
istry I (ChemI), Chemistry IA (ChemlA)} 

o Science 3: One subject from {Physics I (Physl), Earth Science I (EarthI), 
Physics IA (PhysIA), Earth Science IA (EarthIA)} 

Frequencies of the examinees selecting each combination of subjects in 2006 
are given in the website of NCUEE. We reproduce part of them in Tables [8]- 
[T2l at the end of the paper. As seen in these tables, examinees may select or 
not select these subjects. For example, one examinee may select two subjects 
from Social Studies and three subjects from Science, while another examinee 
may select only one subject from Science and none from Social Studies. Hence 
each examinee is categorized into one of the (6 + 1) x ••■ x (4 +1) = 2800 
combinations of individual subjects. Here 1 is added for not choosing from 
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the subcategory. As mentioned above, individual departments of universities 
impose different additional requirements on the choices of subjects of NCT. 
For example, many science or engineering departments of national universities 
ask the students to take two subjects from Science and one subject from Social 
Studies. 

Let us observe some tendencies of the selections by the examinees to illus- 
trate what kind of statistical questions one might ask concerning the data in 
Tables [SHU 

(i) The most frequent triple of Science subjects is {Biol, ChemI, Physl} in Ta- 
ble[T2l which seems to be consistent with Table [TOl since these three subjects 
are the most frequently selected subjects in Science 1, Science 2 and Sci- 
ence 3, respectively. However in Table [TTJ while the pairs {Biol, ChemI} 
and {ChemI, Physl} are the most frequently selected pairs in {Science 1, 
Science2} and {Science 2, Science 3}, respectively, the pair {Biol, Physl} is 
not the first choice in {Science 1, Science 3}. This fact indicates differences 
in the selection of Science subjects between the examinees selecting two 
subjects and those selecting three subjects. 

(ii) In Table [9] the most frequent pair is {GeoB, ContSoc}. However the most 
frequent single subject from Geography and History is JHB both in Table 
[5] and [HI This fact indicates the interaction effect in selecting pairs of Social 
Studies. 

These observations lead to many interesting statistical questions. However 
Tables l8rJT2l only give frequencies of choices separately for Social Studies and 
Science, i.e., they are the marginal tables for these two major subjects. In 
this paper we are interested in independence across these two subjects, such 
as "are the selections on Social Studies and Science related or not?" We give 
various models for NCT data in Section 3.2 and numerical analysis in Section 
5.1. 

2.2 The case of Hardy- Weinberg models for allele frequency data 

We also consider problems of population genetics in this paper. This is another 
important application of the methodology of this paper. The allele frequency 
data are usually given as the genotype frequency. For multi-allele locus with 
alleles A\, A2, ■ ■ ■ , A m , the probability of the genotype AiAj in an individual 
from a random breeding population is qf (i = j) or 2qiqj (i ^ j), where qi is 
the proportion of the allele Ai. These are known as the Hardy- Weinberg equi- 
librium probabilities. Since the Hardy- Weinberg law plays an important role 
in the field of population genetics and often serves as a basis for genetic infer- 
ence, much attention has been paid to tests of the hypothesis that a population 
being sampled is in the Hardy- Weinberg equilibrium against the hypothesis 
that disturbing forces cause some deviation from the Hardy- Weinberg ratio. 
See Crow (1988) and Guo and Thompson (1992) for example. Though Guo 
and Thompson (1992) consider the exact test of the Hardy Weinberg equi- 
librium for multiple loci, exact procedure becomes infeasible if the data size 
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or the number of alleles is moderately large. Therefore MCMC is also use- 
ful for this problem. Takemura and Aoki (2004) considers conditional tests of 
Hardy- Weinberg model by using MCMC and the technique of Markov bases. 

Due to the rapid progress of sequencing technology, more and more infor- 
mation is available on the combination of alleles on the same chromosome. A 
combination of alleles at more than one locus on the same chromosome is called 
a haplotype and data on haplotype counts arc called haplotype frequency data. 
The haplotype analysis has gained an increasing attention in the mapping of 
complex-disease genes, because of the limited power of conventional single- 
locus analyses. Haplotype data may come with or without pairing information 
on homologous chromosomes. It is technically more difficult to determine pairs 
of haplotypcs of the corresponding loci on a pair of homologous chromosomes. 
A pair of haplotypes on homologous chromosomes is called a diplotype. In 
this paper we are interested in diplotype frequency data, because haplotype 
frequency data on individual chromosomes without pairing information are 
standard contingency table data and can be analyzed by statistical methods 
for usual contingency tables. For the diplotype frequency data, the null model 
we want to consider is the independence model that the probability for each 
diplotype is expressed by the product of probabilities for each genotype. 

We consider the models for genotype frequency data in Section 3.3.1 and 
then consider the models for diplotype frequency data in Section 3.3.2. Note 
that the availability of haplotype data or diplotype data requires a separate 
treatment in our arguments. Finally we give numerical examples of the analysis 
of diplotype frequencies data in Section 5.2. 



3 Conditional tests and models 

3.1 General formulation of conditional tests and Markov chain Monte Carlo 
procedures 

First we give a brief review on performing MCMC for conducting conditional 
tests based on the theory of Markov basis. Markov basis was introduced by 
Diaconis and Sturmfcls (1998) and there are now many references on the def- 
inition and the use of Markov basis (e.g. Aoki and Takemura, 2006). 

We denote the space of possible selections as X. Each element i in X rep- 
resents a combination of choices. Following the terminology of contingency 
tables, each i e X is called a cell. It should be noted that unlike the case of 
standard multiway contingency tables, our index set X can not be written as 
a direct product in general. We show the structures of X for NCT data and 
allele frequency data in Section 3.2 and Section 3.3, respectively. 

Let p(i) denote the probability of selecting the combination i (or the prob- 
ability of cell i) and write p = {p(i)}- ie x- In this paper, we do not necessarily 
assume that p is normalized. In fact, in the models we consider in this pa- 
per, we only give an unnormalized functional specification of p(-). Note that 
we need not calculate the normalizing constant XaeiP(i) f° r performing a 
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MCMC procedure. Denote the result of the selections by n individuals as 
x = {x(i)}i£i, where x(i) is the frequency of the cell i. We call x a frequency 
vector. 

In the models considered in this paper, the cell probability p(i) is written as 
some product of functions, which correspond to various marginal probabilities. 
Let J denote the index set of the marginals. Then our models can be written 
as 

P (i) = Mi)Il9(j) Qji > C 1 ) 

where h(i) is a known function and g(j)'s are the parameters. An important 
point here is that the sufficient statistic t = {i(j), j G J} is written in a matrix 
form as 

t = Ax, A = (aji) je .7,iex, (2) 

where A is d x v matrix of non-negative integers and d = \ J~\, v = \T\. We call 
A a configuration in connection with the theory of toric ideals in Section 4. 

By the standard theory of conditional tests (Lehmann and Romano, 2005, 
for example), we can perform conditional test of the model (fTJ) based on the 
conditional distribution given the sufficient statistic t. The conditional sample 
space given t, called the t-fiber, is 

= {x e W I t = Ax}, 

where N = {0,1,...}. If we can sample from the conditional distribution 
over we can evaluate P- values of any test statistic. One of the advantages 
of MCMC method of sampling is that it can be run without evaluating the 
normalizing constant. Also once a connected Markov chain over the conditional 
sample space is constructed, then the chain can be modified to give a connected 
and aperiodic Markov chain with the stationary distribution by the Metropolis- 
Hastings procedure (e.g. Hastings, 1970). Therefore it is essential to construct 
a connected chain and the solution to this problem is given by the notion of 
Markov basis (Diaconis and Sturmfels, 1998). 

The fundamental contribution of Diaconis and Sturmfels (1998) is to show 
that a Markov basis is given as a binomial generator of the well-specified 
polynomial ideal (toric ideal) and it can be given as a Grobner basis. In Section 
4, we show that our problem considered in Section 3.2 and 3.3 corresponds to a 
well-known toric ideal and give an explicit form of the reduced Grobner basis. 

3.2 Models for NCT data 

Following the general formalization in Section 3.1, we formulate data types 
and their statistical models in view of NCT. Suppose that there are J different 
groups (or categories) and rrij different subgroups in group j for j = 1, . . . , J. 
There are rrijk different items in subgroup k of group j (k — 1, . . . , rrij, j = 
1, . . . , J). In NCT, J — 2, mi — |{Geography and History, Civics}| = 2 and 
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similarly TO2 = 3. The sizes of subgroups are mi i = |{WHA, WHB, JHA, JHB, 
GeoA, GeoB}| = 6 and similarly m±2 — 3, m^i = 4, m-2.2 = 3, TO23 = 4. 

Each individual selects Cjk items from the subgroup k of group j. We 
assume that the total number r of items chosen is fixed and common for all 
individuals. In NCT Cjk is either or 1. For example if an examinee is required 
to take two Science subjects in NCT, then (C21, C22, C23) is (1, 1, 0), (1, 0, 1) or 
(0, 1, 1). For the analysis of genotypes in Section l3~3l Cjk = 2 although there is 
no nesting of subgroups, and the same item (allele) can be selected more than 
once (selection "with replacement" ) . 

We now set up our notation for indexing a combination of choices some- 
what carefully. In NCT, if an examinee chooses WHA from "Geography and 
History" of Social Studies and Physl from Science 3 of Science, we denote the 
combination of these two choices as (111)(231). In this notation, the selection 
of Cjk items from the subgroup k of group j are indexed as 

ijfc = {jkh)(jkh) ■ ■ . (jkl Cjk ), 1 < h < ■ • • < h jk < mj k . 

Here ijk is regarded as a string. If nothing is selected from the subgroup, we 
define ijk to be an empty string. Now by concatenation of strings, the set I 
of combinations is written as 

J = {i = i x . . . ij}, ij = iji . . . i jmj , j = 1, . . . , J. 

For example the choice of (P&E, Biol, ChemI) in NCT is denoted by i = 
(123) (212) (222). In the following we denote i' C i if i' appears as a substring 
of i. 

Now we consider some statistical models for p. For NCT data, we con- 
sider three simple statistical models, namely, complete independence model, 
subgroup-wise independence model and group-wise independence model. The 
complete independence model is defined as 

J rrij Cjk 

Mi)=n n t[iM (3) 

j=i fc=i t=i 

for some parameters qjk(J), j = 1; • • • , J] k = 1, . . . , rrij) 1 = 1,..., mjk- Note 
that if Cjk > 1 we need a multinomial coefficient in (|3|). The complete inde- 
pendence model means that each p(i), the inclination of the combination i, is 
explained by the set of inclinations qjk(J) of each item. Here qjk(l) corresponds 
to the marginal probability of the item (jkl). However we do not necessarily 
normalize them as 1 = X);=i ?7&(0j because the normalization for p is not 
trivial anyway. The same comment applies to other models below. 
Similarly, the subgroup-wise independence model is defined as 



J mj 

pw = n n iftihk) 



(4) 
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for some parameters qjk{'), and the group-wise independence model is defined 

as 

J 

p(i)=n%fo) ( 5 ) 

3=1 

for some parameters qj(-). 

In this paper, we treat these models as the null models and give testing 
procedures to assess their fitting to observed data following the general theory 
in Section 3.1. 



3.3 Models for allele frequency data 

3.3.1 Models for the genotype frequency data 

We assume that there are J distinct loci. In the locus j, there are m,j dis- 
tinct alleles, . . . , Aj mj . In this case, we can imagine that each individual 
selects two alleles for each locus with replacement. Therefore the set of the 
combinations is written as 

1 = {i = (niii2)(i2i«22) ■ ■ • (ijiij2) | 1 < iji < ij2 <rrij, j = 1, . . . , J}. 

For the genotype frequency data, we consider two models of hierarchical 
structure, namely, genotype-wise independence model 

j 

/'ii: II </..■:'.. :',2' (6) 

i=i 

and the Hardy- Weinberg model 

J 

= Y[Qj(ijiiji), (7) 

3=1 

where 

WsihV 1 2q j (i jl )q j (i j2 ) if i jX ^ i j2 . W 

Note that for both cases the sufficient statistic t can be written as t = Ax for 
an appropriate matrix A as shown in Section [5T^1 

3.3.2 Models for the diplotype frequency data 

In order to illustrate the difference between genotype data and diplotype data, 
consider a simple case of J = 2, mi = m 2 = 2 and suppose that genotypes of 
n = 4 individuals are given as 

{A n A n ,A 2 iA 2 i}, {A 1X A XX ,A 2X A 22 \, {A n A 12 , A 2 iA 2 i}, {AnA 12 , A 21 A 22 }. 
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In this genotype data, for an individual who has homozygote genotype on at 
least one loci, the diplotypes are uniquely determined. However, for the fourth 
individual who has the genotype {AnAi 2 , A 2 iA 22 } , there are two possible 
diplotypes as {(A u , A 2 i), (A 12 , A 22 )} and {(A n , A 22 ), (A 12 , A 21 )}. 

Now suppose that information on diplotypes are available. The set of com- 
binations for the diplotype data is given as 

1 ={i = iii 2 = {in ■ ■ 'ij\)(i\2 • ■■ij2) | 1 < 1 • ^ m ^ 3 = !>•••> J}- 

In order to determine the order of ii = (in...i r i) and i 2 = (ii 2 ...i r2 ) 
uniquely, we assume that these two are lexicographically ordered, i.e., there 
exists some j such that 



unless ii = i 2 . 

For the parameter p = {p(i)} where p(i) is the probability for the diplotype 
i, we can consider the same models as for the genotype case. Corresponding to 
the null hypothesis that diplotype data do not contain more information than 
the genotype data, we can consider the genotype- wise independence model © 
and the Hardy- Weinberg model (0). The sufficient statistics for these models 
are the same as in the previous subsection. 

If these models are rejected, we can further test independence in diplotype 
data. For example we can consider a haplotype-wise Hardy- Weinberg model. 



The sufficient statistic for this model is given by the set of frequencies of each 
haplotype and the conditional test can be performed as in the case of Hardy- 
Weinberg model for a single gene by formally identifying each haplotype as an 
allele. 

4 Grobner basis for Segre- Veronese configuration 

In this section, we introduce toric ideals of algebras of Segre- Veronese type 
(Ohsugi and Hibi, 2000) with a generalization to fit statistical applications in 
the present paper. 

First we define toric ideals. A configuration in M. d is a finite set A = 
{ai, . . . ,a„} C N d . A can be regarded as a d x v matrix and corresponds 
to the matrix connecting the frequency vector to the sufficient statistic as in 
§2§. Let K be a field and K[q\ — K[q%, . . . , qj\ the polynomial ring in d vari- 
ables over K . We associate a configuration A C N d with the semigroup ring 
K[A] = K[q a \...,q^} where q a = q" 1 ■ ■ ■ q a d d if a = (ax,...,a d ). Note that 
d = \ J\ and q ai corresponds to to the term rijej l{i) ay ' on the right-hand 
side of {T]). Let if[W] = K\w\, . . . , tu„] be the polynomial ring in v variables 
over K. Here v = \X\ and the variables Wx, . . . ,w u correspond to the cells 



in = in, ■ ■ ■ , i]-i,i = ij-1,2, iji < ij2 
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of I. The toric ideal 1a of A is the kernel of the surjective homomorphism 
7r : — > K[A] defined by setting n(wi) = q ai for all 1 < i < v. It is 

known that the toric ideal I a is generated by the binomials u — v, where u and 
v are monomials of if [VF], with n(u) = n(v). More precisely, 1a is written as 



I A = { W z+ - W z 



zeZ", Az = 



where z = z + z with z + , z e N". We call an integer vector zeZ"a move 
if ^z = 0. 

The initial ideal of 7a with respect to a monomial order is the ideal of 
K [W] generated by all initial monomials of nonzero elements of I a- A finite 
set Q of I a is called a Grobner basis of I a with respect to a monomial order < 
if the initial ideal of I a with respect to < is generated by the initial monomials 
of the polynomials in Q. A Grobner basis Q is called reduced if, for each g e Q, 
none of the monomials in g is divisible by the initial monomials of g' for 
some g ^ g' G It is known that if £ is a Grobner basis of I a, then is 
generated by Q. In general, the reduced Grobner basis of a toric ideal consists 
of binomials. See Chapter 4 of Sturmfels (1995) for the details of toric ideals 
and Grobner bases. 

The following proposition associates Markov bases with toric ideals. 

Proposition 1 (Diaconis Sturmfels, 1998) A set of moves B = {zi, . . . , z^} 
is a Markov basis if and only if I a is generated by binomials W Zl — W Zl , 
w z i - W z l . 

We now introduce the notion of algebras of Segre- Veronese type. Fix inte- 
gers r > 2, M > 1 and sets of integers b = {b\, . . . , 6m}, c = {ci, . . . , cm}, 
r = {n, . . . , rjf} and s = {s\, . . . , Sm} such that 

(i) < c, < h for all 1 < i < M; 

(ii) l<Si<ri<d for all 1 < i < M. 

Let A^b.c.r.s C N d denote the configuration consisting of all nonnegative in- 
teger vectors (/i, / 2 , . . . , fd) € N d such that 

(i) E- =1 /, =r. 

(ii) c, < J27= Sl fi < 6 4 for all 1 < i < M. 

Let i^[^4r,b,c,r,s] denote the affine semigroup ring generated by all monomials 
rij=i over K an( f ca U if an algebra of Segre-Veronese type. Note that the 
present definition generalizes the definition in Ohsugi and Hibi (2000). 

Several popular classes of semigroup rings are algebras of Segre-Veronese 
type. If M — 2, r = 2, bi = b 2 = c\ = c 2 = 1, Si = 1, s 2 = r\ + 1 
and r 2 = d, then the affine semigroup ring if [A,-,b,c,r,s] is the Segre product 
of polynomial rings K[q\, . . . , q ri ) and K [g ri +i, . . . , qd]- On the other hand, 
if M = d, Si = r.i = i, bi = t and Cj = for all 1 < i < M, then the 
affine semigroup ring K [j4 T) b,c,r,s] is the classical rth Veronese subring of the 
polynomial ring K[q\, ...,%]. Moreover, if M = d, s, = ri = i, bi = 1 and 
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Cj = for all 1 < i < M, then the affinc semigroup ring K L4 T .b,c.r,s] is 
the rth squarefree Veronese subring of the polynomial ring K[q\, . . . , q^]. In 
addition, algebras of Veronese type (i.e., M = d, Si = r{ = i and Ci = for all 
1 < i < M) are studied in Dc Negri and Hibi (1997) and Sturmfcls (1995). 
Let K[Y] denote the polynomial ring with the set of variables 



Vhh- 



i < h < h < ■ ■ ■ < jr < d, [] Qiu e {q ai , . . . ,q a "} 



k=l 



where -ftTL4 Ti b, c ,r,s] = K[q ai , . . . , q a "]. The toric ideal Ia t b c r s is the ker- 
nel of the surjective homomorphism tt : K[Y] — > K [A Ti b, c ,r,s] defined by 

A monomial y ai a 2 ---a T yfj 1 fj 2 ---f3 T ' ' ' U-yi^^-iT ^ s called sorted if 
ai < /3i < • • • < 71 < 0:2 < 02 < ■ ■ ■ < 72 < • • • < a T < T < • ■ • < It- 

Let sort(-) denote the operator which takes any string over the alphabet 
{1,2, ... ,d} and sorts it into weakly increasing order. Then the quadratic 
Grobncr basis of toric ideal /A T , biC , Pi . is given as follows. 

Theorem 1 Work with the same notation as above. Then there exists a mono- 
mial order on K[Y] such that the set of all binomials 

{j/aia2-ar^i02-/3r _ f7i73-72T-if7274-72r I S0rt(ai/3ia 2 /32 • • -UtPt) = 7172 • • '72r} 

(9) 

is the reduced Grobner basis of the toric ideal Ia t b c r , • The initial ideal is 
generated by squarefree quadratic (nonsorted) monomials. 

In particular, the set of all integer vectors corresponding to the above bi- 
nomials is a Markov basis. Furthermore the set is minimal as a Markov basis. 

Proof. The basic idea of the proof appears in Theorem 14.2 in Sturmfels (1995). 

Let G be the above set of binomials. First we show that Q C 1a ». 
Suppose that m = y ai a 2 ---a T yf3 1 (3 2 ---p T is not sorted and let 

7i72 • • • 72r = sort(ai/3i a 2 /3 2 • • • a T (i T ). 

Then, m is squarefree since the monomial y% ia2 ... a is sorted. Since the bi- 
nomial y aia2 ---a T yp 1 p 2 ---i3 T - y a ' 1 a' 2 --- a ' T yi3' 1 i3' 2 ---(3' T e K[Y] belongs to l u bc , 

if and only if sort(aiQ!2 ■ • ■ ct T 0i02 ■ • ■ A-) — sort (o^a^ 1 ' ' a ' T 0i02 ' " ' @'t)i ^ i s 
sufficient to show that both y 7l73 ..- 72T _ 1 and 2/ 7274 --- 72t are variables of if[V]. 
For 1 < i < n, let p* = \{j \ Si < j 2 j-i < n}\ and <7j = \{j | Sj < 7 2 j < rj}|. 
Since 71 < 72 < • • • < 72T , Pi and m are either equal or they differ by one for 
each i. If pi < ai, then < u% — p% < 1. Since 2a < pi + Oi < 26,, we have 
o~i < hi + 1/2 and c, — 1/2 < pj. Thus c, < p, < crj < 6j. If > ct^, then 
Pi — cTj = I- Since 2ci < Pi + &i < 2&i, we have p^ < 6^ + 1/2 and q — 1/2 < Cj. 
Thus q < Uj < pi < b{. Hence y 7l73 ... 72 _ x and 2/ 7274 - - 72 are variables of 

By virtue of relation between the reduction of a monomial by Q and sorting 
of the indices of a monomial, it follows that there exists a monomial order such 
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that, for any binomial in Q, the first monomial is the initial monomial. See 
also Theorem 3.12 in Sturmfels (1995). 

Suppose that Q is not a Grobner basis. Thanks to Macaulay's Theorem, 
there exists a binomial / £ Ia t b c r s such that both monomials in / are sorted. 
This means that / — and / is not a binomial. Hence Q is a Grobner basis of 
I At b o r s ■ It is easy to see that the Grobner basis Q is reduced and a minimal 
set of generators of Ia t b c r s ■ Q.E.D. 

Finally we describe how to run a Markov chain using the Grobner basis 
given in Theorem [TJ First, given a configuration A in ([2|), we check that (with 
appropriate reordering of rows) that A is indeed a configuration of Segre- 
Veronese type. It is easy to check that our models in Sections 13.21 and 13.31 
are of Segre- Veronese type, because the restrictions on choices are imposed 
separately for each group or each subgroup. Recall that each column of A 
consists of non-negative integers whose sum r is common. 

We now associate to each column of A a set of indices indicating the 
rows with positive elements etji > and a particular index j is repeated aji 
times. For example if d — 4, r = 3 and a; = (1,0,2,0)', then row 1 appears 
once and row 3 appears twice in a^. Therefore we associate the index (1, 3, 3) 
to a;. We can consider the set of indices as r x v matrix A. Note that A and 
A carry the same information. 

Given A, we can choose a random element of the reduced Grobner basis of 
Theorem [T] as follows. Choose two columns (i.e. choose two cells from I) of A 
and sort 2 x r elements of these two columns. From the sorted elements, pick 
alternate elements and form two new sets of indices. For example if r = 3 and 
the two chosen columns of A are (1, 3, 3) and (1, 2, 4), then by sorting these 6 
elements we obtain (1, 1, 2, 3, 3, 4). Picking alternate elements produces (1, 2, 3) 
and (1, 3, 4). These new sets of indices correspond to (a possibly overlapping) 
two columns of A, hence to two cells of X. Now the difference of the two original 
columns and the two sorted columns of A correspond to a random binomial in 
(J9j) . It should be noted that when the sorted columns coincide with the original 
columns, then we discard these columns and choose other two columns. The 
rest of the procedure for running a Markov chain is described in Diaconis and 
Sturmfels (1998). See also Aoki and Takemura (2006). 

5 Numerical examples 

In this section we present numerical experiments on NCT data and a diplotype 
frequency data. 

5.1 The analysis of NCT data 

First we consider the analysis of NCT data concerning selections in Social 
Studies and Science. Because NCUEE currently do not provide cross tabula- 
tions of frequencies of choices across the major subjects, we can not evaluate 
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the P- value of the actual data. However for the models in Section 13. 2\ the 
sufficient statistics (the marginal frequencies) can be obtained from Tables [8]- 
rj"2"1 Therefore in this section we evaluate the conditional null distribution of 
the Pearson's x 2 statistic by MCMC and compare it to the asymptotic x 2 
distribution. 

In Section 13. 2\ we consider three models, complete independence model, 
subgroup-wise independence model and group-wise independence model, for 
the setting of group- wise selection problems. Note that, however, the subgroup- 
wise independence model coincides with the group-wise independence model 
for NCT data, since Cjk < 1 for all j and k. Therefore we consider fitting of 
the complete independence model and the group- wise independence model for 
NCT data. 

As we have seen in Section 2.1, there are many kinds of choices for each 
examinee. However, it may be natural to treat some similar subjects as one 
subject. For example, WHA and WHB may well be treated as WH, ChemI 
and Chcm IA may well be treated as Chem, and so on. As a result, we consider 
the following aggregation of subjects. 

- In Social Studies: WH = {WHA, WHB}, JH = {JHA,JHB}, Geo = {GeoA,GeoB} 

- In Science: CSiB = {CSiB, ISci}, Bio = {Biol, BioIA}, Chem = {ChemI, 
ChemlA}, Phys = {Physl, PhysIA}, Earth = {EarthI, EarthIA} 

In our analysis, we take a look at examinees selecting two subjects for Social 
Studies and two subjects for Science. Therefore 

J = 2, mi = 2, m2 = 3, mn = m 12 = 3, m 2 i = m 2 2 = m 2 3 = 2, 
en = cia = 1, (c 2 i, c 22 , c 23 ) = (1, 1, 0) or (1, 0, 1) or (0, 1, 1). 

The number of possible combination is then v — \T\ — 3 • 3 x 3 • 2 2 = 108. 
Accordingly our sample size is n = 195094, which is the number of examinees 
selecting two subjects on Science from Tabic [TUJ Our data set is shown in 
Table [U 



Table 1 The data set of number of the examinees in NCT in 2006 (n = 195094) 





ContS 


Ethics 


P&E 


WH 


32352 


8839 


8338 


JH 


51573 


8684 


14499 


Geo 


59588 


4046 


7175 





CSiA 


Chem 


Phys 


Earth 


CSiB 


1648 


1572 


169 


4012 


Bio 


21392 


55583 


1416 


1845 


Phys 


3286 


102856 






Earth 


522 


793 







From Table [TJ we can calculate the maximum likelihood estimates of the 
numbers of the examinees selecting each combination of subjects. The suffi- 
cient statistics under the complete independence model are the numbers of 
the examinees selecting each subject, whereas the sufficient statistics under 
the group-wise independence model are the numbers of the examinees select- 
ing each combination of subjects in the same group. The maximum likelihood 
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estimates calculated from the sufficient statistics are shown in Table [21 For 
the complete independence model the maximum likelihood estimates can be 
calculated as in Section 5.2 of Bishop et al. (1975). 



Table 2 MLE of the number of the examinees selecting each combination of subjects under 
the complete independence model (upper) and the group- wise independence model (lower). 





WH 


JH 


Geo 




ContS 


Ethics 




ContS 


Ethics 


P&E 


ContS 


Ethics 


P&E 


CSiB,CSiA 


180.96 


27.20 


37.84 


273.12 


41.05 


57.12 


258.70 


38.88 


54.10 




273.28 


74.66 


70.43 


435.65 


73.36 


122.48 


503.35 


34.18 


60.61 


CSiB.Chcm 


1083.82 


162.89 


226.65 


1635.85 


245.86 


342.10 


1549.48 


232.88 


324.03 




260.68 


71.22 


67.18 


415.56 


69.97 


116.83 


480.14 


32.60 


57.81 


CSiB,Phys 


110.04 


16.54 


23.01 


166.09 


24.96 


34.73 


157.32 


23.64 


32.90 




28.02 


7.66 


7.22 


44.68 


7.52 


12.56 


51.62 


3.50 


6.22 


CSiB.Earth 


7.33 


1.10 


1.53 


11.06 


1.66 


2.31 


10.47 


1.57 


2.19 




665.30 


181.77 


171.47 


1060.57 


178.58 


298.16 


1225.39 


83.20 


147.55 


Bio.CSiA 


1961.78 


294.84 


410.26 


2960.99 


445.02 


619.21 


2804.66 


421.52 


586.52 




3547.39 


969.19 


914.26 


5654.96 


952.20 


1589.81 


6533.81 


443.64 


786.74 


Bio.Chcm 


11749.94 


1765.93 


2457.19 


17734.63 


2665.39 


3708.74 


16798.27 


2524.66 


3512.92 




9217.20 


2518.26 


2375.53 


14693.34 


2474.10 


4130.82 


16976.84 


1152.72 


2044.18 


Bio.Phys 


1193.01 


179.30 


249.49 


1800.65 


270.63 


376.56 


1705.58 


256.34 


356.68 




234.81 


64.15 


60.52 


374.32 


63.03 


105.23 


432.49 


29.37 


52.08 


Bio, Earth 


79.43 


11.94 


16.61 


119.88 


18.02 


25.07 


113.55 


17.07 


23.75 




305.95 


83.59 


78.85 


487.72 


82.12 


137.12 


563.52 


38.26 


67.85 


CSiA.Phys 


2691.94 


404.58 


562.95 


4063.04 


610.65 


849.68 


3848.52 


578.41 


804.82 




544.91 


148.88 


140.44 


868.65 


146.27 


244.21 


1003.65 


68.15 


120.85 


CSiA, Earth 


179.22 


26.94 


37.48 


270.50 


40.65 


56.57 


256.22 


38.51 


53.58 




86.56 


23.65 


22.31 


137.99 


23.24 


38.79 


159.44 


10.83 


19.20 


Bio.Phys 


16123.14 


2423.20 


3371.73 


24335.27 


3657.42 


5089.09 


23050.40 


3464.31 


4820.39 




17056.38 


4660.03 


4395.90 


27189.93 


4578.31 


7644.05 


31415.54 


2133.10 


3782.75 


Bio, Earth 


1073.41 


161.33 


224.48 


1620.14 


243.50 


338.81 


1534.60 


230.64 


320.92 




131.50 


35.93 


33.89 


209.63 


35.30 


58.93 


242.21 


16.45 


29.16 



The configuration A for the complete independence model is written as 



.4: 



l'o ® 



la <E> E 3 ® l'j 



l'o 



12 
B 



and the configuration A for the group-wise independence model is written as 



.4 



l'o ® E> 



12 



12 



D 



where E n is the n x n identity matrix, 1„ = (1, . . . , 1)' is the n x 1 column 
vector of l's, ® denotes the Kronecker product and 

111100000000' 
000011110000 
100010001100 
010001000011 
001000101010 
000100010101 

Note that the configuration B is the vertex-edge incidence matrix of the 
(2, 2, 2) complete multipartite graph. Quadratic Grobner bases of toric ide- 
als arising from complete multipartite graphs are studied in Ohsugi and Hibi 
(2000). 
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Given these configurations we can easily run a Markov chain as discussed 
at the end of Section [4] After 5,000,000 burn-in steps, we construct 10,000 
Monte Carlo samples. Figure Q] show histograms of the Monte Carlo sampling 
generated from the exact conditional distribution of the Pearson goodness-of- 
fit x 2 statistics for the NCT data under the complete independence model and 
the group- wise independence model, respectively, along with the corresponding 
asymptotic distributions Xds an< ^ Xs8- 



Complete independence model (df = 98) Group-wise independence model (df - 
Fig. 1 Asymptotic and Monte Carlo sampling distributions of NCT data 



5.2 The analysis of PTGDR (prostanoid DP receptor) diplotype frequencies 
data 

Next we give a numerical example of genome data. Table [3] shows diplotype 
frequencies on the three loci, T-549C (locus 1), C-441T (locus 2) and T-197C 
(locus 3) in the human genome 14q22.1, which is given in Oguma et al. (2004). 
Though the data is used for the genetic association studies in Oguma et al. 
(2004), we simply consider fitting our models. As an example, we only consider 
the diplotype data of patients in the population of blacks (n = 79). 

First we consider the analysis of genotype frequency data. Though Table 
is diplotype frequency data, here we ignore the information on the hap- 
lotypes and simply treat it as a genotype frequency data. Since J = 3 and 
mi = m-2 = Tn>3 = 2, there are 3 3 = 27 distinct set of genotypes, i.e., \X\ = 27, 
while only 8 distinct haplotypes appear in Table [31 Table 0] is the set of geno- 
type frequencies of patients in the population of blacks. Under the genotype- 
wise independence model © , the sufficient statistic is the genotype frequency 
data for each locus. On the other hand, under the Hardy- Weinberg model 
([7]), the sufficient statistic is the allele frequency data for each locus, and the 
genotype frequencies for each locus are estimated by the Hardy- Weinberg law. 
Accordingly, the maximum likelihood estimates for the combination of the 
genotype frequencies are calculated as Table [5j The configuration A for the 
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Table 3 PTGDR diplotype frequencies among patients and controls in each population. 
(The order of the SNPs in the haplotype is T-549C, C-441T and T-197C.) 



Diplotype Whites Blacks 

Controls Patients Controls Patients 



CCT/CCT 


16 


78 


7 


10 


CCT/TTT 


27 


106 


12 


27 


CCT/TCT 


48 


93 


4 


12 


CCT/CCC 


17 


45 


3 


9 


TTT/TTT 


9 


43 


2 


7 


TTT/TCT 


34 


60 


8 


6 


TTT/CCC 


4 


28 


1 


6 


TCT/TCT 


11 


20 


7 





TCT/CCC 


6 


35 


1 


2 


CCC/CCC 


1 


8 









Table 4 The genotype frequencies for patients among blacks of PTGDR data 



locus 3 


CC 


CT 


TT 


locus 2 


CC 


CT 


TT 


CC 


CT 


TT 


CC 


CT 


TT 


locus 1 


CC 











9 








10 










CT 











2 


6 





12 


27 







TT 























6 


7 



Table 5 MLE for PTGDR genotype frequencies of patients among blacks under the Hardy- 
Wcinbcrg model (upper) and genotype-wise independence model (lower) 



locus 3 


CC 


CT 


TT 


locus 2 


CC 


CT 


TT 


CC 


CT 


TT 


CC 


CT 


TT 


locus 1 


CC 


0.1169 



0.1180 



0.0298 



1.939 
1.708 


1.958 
2.018 


0.4941 
0.3623 


8.042 
6.229 


8.118 
7.361 


2.049 
1.321 


CT 


0.2008 



0.2027 



0.0512 



3.331 
4.225 


3.362 
4.993 


0.8486 
0.8962 


13.81 
15.41 


13.94 
18.21 


3.519 
3.268 


TT 


0.0862 



0.0870 



0.0220 



1.430 
1.169 


1.444 
1.381 


0.3644 
0.2479 


5.931 
4.262 


5.988 
5.037 


1.511 
0.9040 



Hardy- Weinberg model is written as 



A = 



222222222 111111111 000000000 
000000000 111111111 222222222 
222111000 222111000 222111000 
000111222 000111222 000111222 
210210210 210210210 210210210 
012012012 012012012 012012012 



and the configuration A for the genotype-wise independence model is written 



as 



A = 



£3 ® 1 3 <S> Y 3 
I3 ® E 3 <g> I3 



1', <8> 1', 



>E' 3 
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Since these two configurations are of the Segre- Veronese type, again we can 
easily perform MCMC sampling as discussed in SectionJH After 100, 000 burn- 
in steps, we construct 10, 000 Monte Carlo samples. Figure [2] shows histograms 
of the Monte Carlo sampling generated from the exact conditional distribution 
of the Pearson goodness-of-fit x 2 statistics for the PTGDR genotype frequency 
data under the Hardy- Weinberg model and the genotype-wise independence 
model, respectively, along with the corresponding asymptotic distributions X2i 
and x|i- 




Hardy- Weinberg model (df = 24) Genotype-wise independence model (df = 21) 



Fig. 2 Asymptotic and Monte Carlo sampling distributions of PTGDR genotype frequency 
data 



From the Monte Carlo samples, we can also estimate the P-values for 
each null model. The values of the Pearson goodness-of-fit x 2 for the PTGDR 
genotype frequency data of Table [4] are \ 2 = 88.26 under the Hardy- Weinberg 
models, whereas \ 2 — 103.37 under the genotype-wise independence model. 
These values are highly significant (p < 0.01 for both models), which implies 
the susceptibility of the particular haplotypes. 

Next we consider the analysis of the diplotype frequency data. In this case 
of J = 3 and mi = 7712 = Tna — 2, there are 2 3 = 8 distinct haplotypes, and 
there are 

\1\ = 8 + (f \ = 36 

distinct diplotypes, while there are only 4 haplotypes and 10 diplotypes appear 
in Table[3] The numbers of each haplotype are calculated as the second column 
of Tabic [6l Under the Hardy- Weinberg model, the haplotype frequencies are 
estimated proportionally to the allele frequencies, which is shown as the third 
column of Table [5] The maximum likelihood estimates of the diplotype fre- 
quencies under the Hardy- Weinberg model are calculated from the maximum 
likelihood estimates for each haplotype. These values coincide with appropriate 
fractions of the values for the corresponding combination of the genotypes in 
TableEl For example, the MLE for the diplotype CCT/CCT coincides with the 
MLE for the combination of the genotypes (CC,CC,TT) in TableEl whereas 
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Table 6 Observed frequency and MLE under the Hardy- Weinberg model for PTGDR hap- 
lotype frequencies of patients among blacks. 



Haplotype 


observed 


MLE under HW 


Haplotype 


observed 


MLE under HW 


CCC 


17 


6.078 


TCC 





5.220 


OCT 


(58 


50.410 


TCT 


20 


43.293 


CTC 





3.068 


TTC 





2.635 


CTT 





25.445 


TTT 


53 


21.853 



the MLE's for the diplotype CCC/TTT, CCT/TTC, CTC/TCT, CTT/TCC 
coincide with the j fraction of the MLE for the combination of the genotypes 
(CT,CT,CT), and so on. Since we know that the Hardy- Weinberg model is 
highly statistically rejected, it is natural to consider the haplotype- wise Hardy- 
Weinberg model given in Section l3.3.2l Table [7] shows the maximum likelihood 
estimates under the haplotype- wise Hardy- Weinberg model. It should be noted 
that the MLE for the other diplotypes are all zeros. We perform the Markov 

Table 7 MLE for PTGDR diplotype frequencies of patients among blacks under the 
haplotype-wise Hardy- Weinberg model. 



Diplotype 


observed 


MLE 


Diplotype 


observed 


MLE 


CCT/CCT 


10 


14.6329 


TTT/TCT 


6 


6.7089 


CCT/TTT 


27 


22.8101 


TTT/CCC 


6 


5.7025 


CCT/TCT 


12 


8.6076 


TCT/TCT 





1.2658 


CCT/CCC 


9 


7.3165 


TCT/CCC 


2 


2.1519 


TTT/TTT 


7 


8.8892 


CCC/CCC 





0.9146 



chain Monte Carlo sampling for the haplotype-wise Hardy- Weinberg model. 
The configuration A for this model is written as 

" 2000000011 1 1 1 1 1000000000000000000000" 
020000001000000111111000000000000000 
002000000100000100000111110000000000 
. _ 000200000010000010000100001111000000 
000020000001000001000010001000111000 ' 
000002000000100000100001000100100110 
000000200000010000010000100010010101 
000000020000001000001000010001001011 

which is obviously of the Segre- Veronese type. We give a histogram of the 
Monte Carlo sampling generated from the exact conditional distribution of 
the Pearson goodness-of-fit \ 2 statistics for the PTGDR diplotype frequency 
data under the haplotype-wise Hardy- Weinberg model, along with the corre- 
sponding asymptotic distributions xl m Figure [3] 

The P-value for this model is estimated as 0.8927 with the estimated stan- 
dard deviation 0.0029 (We also discard the first 100, 000 samples, and use a 
batching method to obtain an estimate of variance, see Hastings (1970) and 
Ripley (1987)). Note that the asymptotic P-value based on Xg is 0.6741. 
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Fig. 3 Asymptotic and Monte Carlo sampling distributions of PTGDR diplotype frequency 
data under the haplotype-wise Hardy- Weinberg model (df = 9). 



6 Some discussions 



In this paper we considered independence models in group-wise selections, 
which can be described in terms of a Segre- Veronese configuration. We have 
shown that our framework can be applied to two important examples in ed- 
ucational statistics and biostatistics. We expect that the methodology of the 
present paper finds applications in many other fields. 

In the NCT example, we assumed that the examinees choose the same num- 
ber r of subjects. We also assumed for simplicity that the examinees choose 
either nothing or one subject from a subgroup. This restricts our analysis to 
some subset of the examinees of NCT. Actually the examinees make decisions 
on how many subjects to take and modeling this decision making is clearly 
of statistical interest. Further complication arises from the fact that the ex- 
aminees can choose which scores to submit to universities after taking NCT. 
For example after obtaining scores of three subjects on Science, an examinee 
can choose the best two scores for submitting to a university. In our subse- 
quent paper (Aoki et al., 2007) we present a generalization of Segre- Veronese 
configurations to cope with these complications. 

It seems that the simplicity of the reduced Grobner basis for the Segre- 
Veronese configuration comes from the fact that the index set J of the rows 
of A can be ordered and the restriction on the counts can be expressed in 
terms of one-dimensional intervals. From statistical viewpoint, ordering of the 
elements of the sufficient statistic in group- wise selection seems to be somewhat 
artificial. It is of interest to look for other statistical models, where ordering of 
the elements of the sufficient statistic is more natural and the Segre- Veronese 
configuration can be applied. 
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A Tables of numbers of examinees in NCT in 2006 

Table 8 Number of examinees who takes subjects on Social Studies 





Geography and History 


Civics 


# total 
examinees 


# actual 
examinees 


WHA 


WHB 


JHA 


JHB 


GcoA 


GeoB 


ContS 


Ethics 


P&E 


1 subject 


496 


29,108 


1,456 


54,577 


1,347 


27,152 


40,677 


16,607 


25,321 


196,741 


196,741 


2 subjects 


1,028 


61,132 


3,386 


90,427 


5,039 


83,828 


180,108 


27,064 


37,668 


489,680 


244,840 


Total 


1,524 


90,240 


4,842 


145,004 


6,386 


110,980 


220,785 


43,671 


62,989 


686,421 


441,581 



Table 9 Number of examinees who selects two subjects on Social Studies 



Civics 


Geography and History 


Total 


WHA 


WHB 


JHA 


JHB 


GcoA 


GeoB 


ContSoc 


687 


39,913 


2,277 


62,448 


3,817 


70,966 


180,108 


Ethics 


130 


10,966 


409 


10.482 


405 


4,672 


27,064 


P&E 


211 


10253 


700 


17,497 


817 


8,190 


37,668 


Total 


1,028 


61,132 


3,386 


90,427 


5,039 


83,838 


244,840 



Table 10 Number of examinees who takes subjects on Science 





Science 1 


Science 2 


Science 3 


# total 
examinees 


#actual 
examinees 


CSciB 


Biol 


ISci 


BioIA 


CSciA 


Chcml 


ChemlA 


Physl 


EarthI 


PhysIA 


EarthIA 


1 subject 


2,558 


80,385 


511 


1,314 


1,569 


19,616 


717 


14,397 


10,788 


289 


236 


132,380 


132,380 


2 subjects 


6,878 


79,041 


523 


1,195 


26,848 


158,027 


2,777 


106,822 


6,913 


905 


259 


390,188 


195,094 


3 subjects 


7,942 


18,519 


728 


490 


6,838 


20,404 


437 


18,451 


8,423 


361 


444 


83,037 


27,679 


Total 


17,378 


177,945 


1,762 


2,999 


35,255 


198,047 


3,931 


139,670 


26,124 


1,555 


939 


605,605 


355,153 



Table 11 Number of examinees who selects two subjects on Science 





Science 2 


Science 3 




CSciA 


ChemI 


ChcmlA 


Physl 


EarthI 


PhysIA 


EarthIA 


Science 1 


CSciB 


1,501 


1,334 


23 


120 


3,855 


1 


44 




Biol 


21,264 


54,412 


244 


1,366 


1,698 


5 


52 




ISci 


147 


165 


50 


43 


92 


5 


21 




Biol A 


128 


212 


715 


16 


33 


29 


62 


Science 3 


Physics 


3,243 


101,100 


934 












EarthI 


485 


730 


20 












PhysIA 


43 


54 


768 












EarthIA 


37 


20 


23 











Table 12 Number of examinees who selects three subjects on Science 



Science 3 


Physl 


EarthI 


Physics IA 


Earth science IA 


Science 2 


CSciA 


ChemI 


ChcmlA 


CSciA 


ChemI 


ChemlA 


CSciA 


ChemI 


ChcmlA 


CSciA 


ChemI 


ChcmlA 


Science 1 


CSciB 


1,155 


5,152 


17 


1,201 


317 


7 


16 


5 


16 


48 


5 


3 


Biol 


553 


10,901 


31 


3,386 


3,342 


16 


30 


35 


19 


130 


56 


20 


ISci 


80 


380 


23 


62 


34 


4 


32 


13 


27 


48 


14 


11 


Biol A 


6 


114 


39 


22 


22 


10 


12 


6 


150 


57 


8 


44 



